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Abstract: With more and more complete genome sequences having been 
released, phylogenetic analysis is entering a new era - that of phylogenomics. 
In this paper, a novel phylogenomic method, named as Base-Base Correlation 
(BBC), has been proposed to infer phylogenetic relationships from complete 
genomes, with particular emphasis on coronavirus phylogeny. Following the 
high-profile publicity of SARS outbreaks, a renewed interest in coronavirus has 
been promoted and two novel human corona viruses (NL63 and HKU1) have 
been identified. Coronavirus phylogenomics based on BBC is well consistent 
with that of previous studies. BBC, to study genome information structure 
based on information theory, provides a novel alignment-free phylogenomic 
methodology in postgenome informatics. 
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1 Introduction 

Until recently, the traditional phylogeny was mainly based on 16S small ribosomal RNA 
(16S rRNA) sequence comparisons. Although such molecules have proved to be 
universal distribution and evolutionary conservation, mutational saturation is a problem, 
due to their restricted lengths (Henz et al., 2005; Moreira and Philippe, 2000). 
Moreover, it has been shown that rRNA-based phylogeny can be sometimes grossly 
misleading in inferring phylogenetic relationships in the presence of unequal rates of 
evolution or differences in base composition (Philippe and Laurent, 1998). To overcome 
this limitation, it is tempting to apply a genome-scale approach to phylogenetic inference 
(phylogenomics). The rapidly increasing availability of complete genome sequence has 
also prompted an interest in using whole genome information to infer phylogenetic 
relationships. 

In addition, traditional phylogenetic methods include a model of multiple sequence 
alignment. However, when large genome sequences are analysed, the traditional 
alignment methods appear to be time consuming. Moreover, in sequence alignment, 
insertions and deletions are poorly evaluated due to the assumption of regular 
evolutionary models (Grasso and Lee, 2004; Lee et al., 2002; Raphael et al., 2004). 
Thus, there is a need for an efficient alignment-free way to transcribe whole genome 
sequence into pertinent phylogenetic information. 

Here we developed a novel phylogenomic approach without alignment, named BBC, 
which is inspired from using Mutual Information Function (MIF) to analyse DNA 
sequence. Compared with MIF, BBC emphasised the information of different base pairs 
within the range of k. It improved the resolving power and provided a more appropriate 
description of sequence dissimilarity (Liu et al., 2007). In this paper, we present our study 
of applying BBC to phylogenetic inference, with particular emphasis on coronavirus 
phylogeny. 

Coronavirus is a genus of animal virus belonging to the family Coronaviridae. 
Coronaviruses are enveloped viruses with a positive-sense, single-stranded RNA genome 
and a helical symmetry. The genome size of coronaviruses ranges from approximately 
16-31 kilobases, extraordinarily large for an RNA virus (Rota et al., 2003). 
Coronaviruses can be divided into three groups according to serotypes. Groups I 
and II contain mammalian viruses, while group II coronaviruses contain a hemagglutinin 
esterase gene homologous to that of Influenza C virus. Group III contains only avian 
viruses. In 2003, a novel coronavirus was isolated and found to be the cause of severe 
acute respiratory syndrome, which had begun the prior year in Asia, and secondary cases 
elsewhere in the world. The virus was officially named the SARS Coronavirus 
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(SARS-CoV). For many years, scientists knew only about the existence of two human 
coronaviruses (HCoV-229E and HCoV-OC43). The discovery of SARS-CoV has 
promoted a renewed interest in coronavius in the field of virology. By the end 
of 2004, three independent research labs reported the discovery of a fourth human 
coronavirus (Hofmann et ah, 2005). It has been named NL63, NL or the New Haven 
coronavirus by the different research groups. Early in 2005, a research team at the 
University of Hong Kong reported finding a fifth human coronavirus in two pneumonia 
patients, and subsequently named it HKU1 (Woo et al., 2005). 


2 Methods 

2 .1 Materials 

A total of 26 complete coronavirus genomes used in this study were retrieved from 
NCBI (http://www.ncbi.nlm.nih.gov/). The name, abbreviation, accession number, 
genome length, and the existing taxonomic groups for the 26 coronavirus genomes are 
shown in Table 1. 


Table 1 The name, abbreviation, accession number, and genome length for each 
of the 26 genomes 


No. 

Genomes 

Abbreviation 

Accession 

Length (nt) 

Group 

1 

Human coronavirus 229E 

HCoV-229E 

NC 002645 

27,317 

I 

2 

Transmissible gastroenteritis virus 

TGEV 

NC 002306 

28,586 

I 

3 

Porcine epidemic diarrhea virus 

PEDV 

NC 003436 

28,033 

I 

4 

Bovine coronavirus strain Mebus 

BCoVM 

U00735 

31,032 

II 

5 

Bovine coronavirus isolate BCoV-LUN 

BCoVL 

AF391542 

31,028 

II 

6 

Bovine coronavirus strain Quebec 

BCoVQ 

AF220295 

31,100 

II 

7 

Bovine coronavirus 

BCoV 

NC 003045 

31,028 

II 

8 

Murine hepatitis virus strain ML-10 

MHVM 

AF208067 

31,233 

II 

9 

Murine hepatitis virus strain 2 

MHV2 

AF201929 

31,276 

II 

10 

Murine hepatitis virus strain Penn 97-1 

MHVP 

AF208066 

31,112 

II 

11 

Murine hepatitis virus 

MHV 

NC_001846 

31,357 

II 

12 

Avian infectious bronchitis virus 

IBV 

NC 001451 

27,608 

III 

13 

SARS coronavirus BJ01 

BJ01 

AY278488 

29,725 

IV 

14 

SARS coronavirus Urbani 

Urbani 

AY278741 

29,727 

IV 

15 

SARS coronavirus HKU-39849 

HKU-39849 

AY278491 

29,742 

IV 

16 

SARS coronavirus CUHK-W1 

CUHK-W1 

AY278554 

29,736 

IV 

17 

SARS coronavirus CUHK-SulO 

CUHK-SulO 

AY282752 

29,736 

IV 

18 

SARS coronavirus Sin2500 

SIN2500 

AY283794 

29,711 

IV 

19 

SARS coronavirus Sin2677 

SIN2677 

AY283795 

29,705 

IV 

20 

SARS coronavirus Sin2679 

SIN2679 

AY283796 

29,711 

IV 

21 

SARS coronavirus Sin2748 

SIN2748 

AY283797 

29,706 

IV 
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Table 1 The name, abbreviation, accession number, and genome length for each 
of the 26 genomes (continued) 


No. 

Genomes 

Abbreviation 

Accession 

Length (nt) 

Group 

22 

SARS coronavirus Sin2774 

SIN2774 

AY283798 

29,711 

IV 

23 

SARS coronavirus TW1 

TW1 

AY291451 

29,729 

IV 

24 

SARS coronavirus 

TOR2 

NC_004718 

29,751 

IV 

25 

Human coronavirus NL63 

NL63 

NC_005831 

27,553 

I 

26 

Human coronavirus HKU1 

HKU1 

NC_006577 

29,926 

II 


2.2 Base-Base Correlation (BBC) 


DNA sequences can be viewed as symbolic strings composed of the four letters 
(Bi, B 2 , B 3 , B 4 ) = (A, C, G, T). The probability of finding the base B t is denoted by 
Pi(i = 1, 2, 3, 4). Then BBC is defined as the following: 

T ij {k) = f jPij {l)-\o El {XL ) /, j g {1, 2, 3, 4}. (1) 

/=, PiPj 

Here,^y(/) means the joint probabilities of bases i and j at a distance of /. Tj/k) represents 
the average relevance of the two-base combination with different gaps from 1 to k. 
It reflects a local feature of two bases within the range of k. For each genome sequences 
m , BBC has 16 parameters and constitutes a 16-dimensional vector V( n (z - 1, 2,16). 

Let L be a whole genome sequence length (1 <k<L). Thus, Ty(L) contains all base 
pairs information for this genome sequence. Theoretically, BBC feature extract more 
fully genome information when k is larger. However, we find that BBC has no 
considerable changes when k> 147 (Liu et al., 2007). Biological significance of k value 
may be related to the fact that nucleosomal DNA contains a core DNA region with a 
stable length of 147 bp, which is relatively resistant to digestion by nucleases. 
So, we take k = 147 in BBC calculation for genome sequence in the present study. 

Statistical independence of two bases in a distance / is defined by py(l ) = ppj. 
Thus, deviations from statistical independence is defined by 

D ij (l) = P ij (l)-P i P j ■ (2) 


We expand Ty(k) using a Taylor series in terms of equation (2) 

T,m=±p,vyiosJ P,V) ' 


1 =1 
k 


v P<Pj j 


= L [Dy (0 + PiPj 1' In 


/=1 
k 


1 + 


Ml 

p t Pj 


= ZtA ,«) + P<Pjl 


1=1 

k 


mlmi 

PiPj 2p,p 


+ 


J 


=ZA<0 


+ 


D ' ,il) +o[DUl)l 


1=1 


o 2 2 

2 Pi Pj 


(3) 
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This mathematical transformation further increases the calculation speed and solves 
effectively the problem of 01og 2 0 () = 0 in equation (1)). 


2.3 The distance matrix 


Given two sequences m and n , the distance H mn between two sequences m and n is 
defined as the following: 


H 


mn 



m, n = 1, 2, •••, N. 



Here, V m and V n represent the 16-dimensional vectors of sequences m and n. N is the total 
number of all sequences analysed. According to equation (4), H mn satisfies the definition 
of distance: (zr) H mn > 0 for m±n\ ( 6 ) H mm = 0; ( p ) H mn = H nm (symmetric); 
(ex) H mn < H mq + H nq (triangle inequality). For N sequences, a real symmetric 
NxN distance matrix is then obtained. 


2.4 Clustering 

Accordingly, a real symmetric NxN matrix is used to reflect the evolutionary distance 
between N sequences. Then, the clustering tree is constructed using neighbour-joining 
method. The reliability of the branches is assessed by performing 100 resamplings. 
Bootstrap values are shown on nodes. 


3 Results 

3.1 GC content of 26 coronavirus genomes 

GC content for each of 26 coronavirus genomes is shown in Figure 1. The GC content of 
coronavirus is below the value of 0.5. The GC content of 12 SARS-CoVs remain 
relatively stable at 0.4. NL63 and HKU1, which were identified after the outbreak of 
SARS, are two novel human coronaviruses with GC content below the value of 0.35. 
The GC content of HKU1 is 0.32, the lowest among all known coronaviruses. 
Previous studies have revealed a statistical relationship between gene density and GC 
content, whereas genome sequences with low GC content were also found to correlate 
with long intron length and a high LINE repeat density (Versteeg et al., 2003). 

Figure 1 GC content for each of 26 coronavirus genomes 
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3.2 BBC curves of 26 coronavirus genomes 

For each genome sequence, 16 parameters of BBC are calculated and linked to a 
continuous curve, which is designated BBC curve. BBC curve is then represented 
as a unique feature for a given genome sequence, providing an intuitionistic and general 
description for genome sequence. 

BBC curves of 26 coronavirus genome sequences are displayed in Figure 2. 
Each curve represents a full-length coronavirus genome. It is found that BBC curves of 
SARS-CoVs (genome Nos. 13-24) are distinct from other coronaviruses. 


Figure 2 BBC curves of 26 coronavirus genomes 



3.3 The distance matrix of 26 coronavirus genomes 

Figure 3 shows the distance matrix for 26 coronavirus genomes. This figure has 
two interesting features. First, a clear block structure indicates that coronavirus is divided 
into four groups. The blocks of SARS-CoVs (genome Nos. 13-24) are significantly 
different from the other blocks. Second, the blocks of NL63 and HKU1 (genome Nos. 25 
and 26), identified as two novel human coronaviruses, are also distinct from the blocks 
of SARS-CoVs. 

3.4 Coronavirus phytogeny based on Base-Base Correlation (BBC) 

As shown in Figure 4, four groups of coronavirus can be seen from the phylogram. 
The SARS-CoVs appear to cluster together and form a separate branch, which can be 
distinguished easily from other three groups of coronavirus. NL63 and HCoV-229E tend 
to cluster together. PEDV and TGEV join them and result in group I. In another branch, 
the group II coronaviruses, including three subgroups (Bovine coronavirus, Murine 
hepatitis virus strain and Human coronavirus HKU1), tend to cluster together. Moreover, 
groups I and II, which are all mammalian viruses, cluster together forming a bigger 
group. IBV, belonging to group III, is situated at an independent branch. The resulting 
monophyletic clusters agree perfectly with the established taxonomic groups. Our results 
also show NL63 and HKU1 belong to groups I and II, respectively. 
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Figure 3 Density plot of the distance matrix for 26 coronavirus genomes 
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Figure 4 Coronavirus phylogeny based on base-base correlation 
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4 Discussion 

In the present study, a novel algorithm based on BBC is proposed. Then, this algorithm is 
used for coronavirus phylogeny. The phylogenetic tree constructed by BBC algorithm 
can well agree with that of previous study. 

Previous phylogenetic inference is based on multiple sequence alignment. However, 
a global multiple alignment of whole genome sequences appears to be time consuming. 
BBC vectors of 26 coronavirus genomes were calculated within a few seconds on a 
regular PC. However, multiple sequence alignment of 26 coronavirus genome sequences 
was performed with a few hours using ClustalX on the same PC. 
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In addition, most tools for multiple sequence alignment need extra operation such as 
“exclude positions with gaps”, “correct for multiple substitutions” before constructing 
trees (Chenna et al., 2003; Thompson et ah, 1997; Jeanmougin et ah, 1998). 
These operations may throw away the most ambiguous parts of the alignment and 
underestimate actual evolutionary distances. Especially for sequences with very large 
divergence, the evolutionary distance cannot be reliably corrected by these alignment 
tools (Chenna et al., 2003; Thompson et al., 1997; Jeanmougin et al., 1998). BBC, as an 
alignment-free method, can overcome this limitation. A nucleotide sequence, regardless 
of its length is kilobases, megabases, or even gigabases, corresponds to a unique 
16-dimensional vector. The procedure is actually a normalisation operation to compare 
genomes of different scales, which are difficult to obtain a good sequence alignment. 
Changes in the values of 16 parameters reflect different genome length and content. 
It is usually thought that higher sequence similarity may represent closer genetic 
relationships between virus strains. It also implies that BBC vectors tend to be more 
similar if virus strains are in closer genetic relationships. The evolutionary distance 
matrix is obtained by arithmetic operations between these 16-dimensional vectors, and 
then is used for the construction of phylogenetic tree. 

Moreover, most phylogenetic analysis is always based on some special genes or some 
conserved fragments because these conservative regions tend to be more evolutionarily 
conserved. But analysis based on various parts of the genome may lead to different 
phylogenetic inferences. It is valuable to develop methods of whole genome phylogeny to 
overcome the biases. As a phylogenomic method, BBC has been applied to whole 
genome analysis (Liu and Sun, 2007; Liu et al., 2008). Actually, BBC considers 
full-length genome sequence as a whole, including coding and noncoding regions. The 
latter is associated with biological functions and may play an important role in the virus 
evolution, former study found that BBC differed significantly between coding regions 
and noncoding regions (Liu et al., 2005). Phylogenetic analysis based on BBC that 
considers whole genome information including coding and noncoding regions, is likely to 
be more objective. 

In addition, several genome-wide phylogenetic methods such as gene order 
(Boore and Brown, 1998) and gene content (Snel et al., 1999; Huson and Steel, 2004) 
need to identify gene. However, identification of gene is a time-consuming procedure. 
BBC method does not require gene identification or any human intervention. 


5 Conclusions 

With fast development of worldwide genome sequencing project, more and more 
completely sequenced genomes become available. However, traditional sequence 
alignment tools and regular evolutionary models are impossible to deal with large-scale 
genome sequence. In the present study, a novel phylogenomic method, named BBC, 
is proposed. We applied BBC to the coronavirus phylogeny. The result is well consistent 
with that of previous analysis. BBC, not limited to coronavirus phylogeny, provides 
a fast and intuitionistic tool for whole genome sequence comparison analysis. BBC, 
based on information theory, provides a new phylogenomic methodology without 
alignment in postgenome informatics. 
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